The loan data from Prosper, an online lending platform that connects people who want to borrow money with individuals and institutions, is explored here. The loan data includes loan characteristics, Prosper's internal tracking data, borrower profile and some lender information.
The analysis will focus on
The first question is of interest for potential lenders, who may not be familiar with loan pricing mechanism, or even for auditors and competitors, while the second question is a proxy for how a return estimate relates to actual returns. The latter is a proxy because losses are one part of return (the other part being yield and other fees).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="white")
# load dataset
df = pd.read_csv('prosperLoanData.csv')
df.head()
df.shape
df.info()
#converting all dates to datetime
df.ListingCreationDate = pd.to_datetime(df.ListingCreationDate)
df.LoanOriginationDate = pd.to_datetime(df.LoanOriginationDate)
df.DateCreditPulled = pd.to_datetime(df.DateCreditPulled)
df.ClosedDate = pd.to_datetime(df.ClosedDate)
# LoanStatus into ordered categorical type
loanstatus_order = ['Current', 'FinalPaymentInProgress', 'Completed', 'Past Due (1-15 days)',
'Past Due (16-30 days)', 'Past Due (31-60 days)', 'Past Due (61-90 days)',
'Past Due (91-120 days)', 'Past Due (>120 days)', 'Defaulted',
'Chargedoff']
ordered_var = pd.api.types.CategoricalDtype(ordered = True, categories = loanstatus_order)
df.LoanStatus = df.LoanStatus.astype(ordered_var)
Variable dictionary mentions a change in loan data applicable for loans originated after July 2009; this includes calculations of Effective Yield, Estimated Loss and Effective Return, key variables analysed. Therefore, I will drop the loan listings that occure before July 2009. Before doing this, I will check the loan listing distribution accross time, to ensure there will be sufficient remaining data for the proposed analysis.
plt.figure(figsize=[15,5])
df.groupby([df.ListingCreationDate.dt.year,df.ListingCreationDate.dt.month])['ListingCreationDate'].count().plot(kind='bar')
plt.xlabel('Listing creation period (year, month)');
There appears to be sufficient data for analysis for post July 2009 period. Will drop earlier observations (in line when first EstimatedReturn is recorded).
min_date = min(df[~df.EstimatedReturn.isnull()].ListingCreationDate)
min_date
df = df[df.ListingCreationDate>= min_date]
df.shape
As the dataset contains 81 variables, I will create a correlation matrix for the numeric variables, in order to make a preliminary decision as to which set of variables are likely to relate to variables of interest.
This will not be conclusive analysis, because, among others:
#lets check correlations of all numeric variables
plt.figure(figsize=[20,20])
#compute correlations
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Plot heatmap
sns.heatmap(corr, cmap = 'RdBu', vmin = -1, vmax = 1, mask=mask, linewidths=0.5)
plt.title('Correlation table (numerical variables)');
There are 84,853 observations in the dataset (abreviated to include only post 13-July-2009 observations), where 13-July-2009 marks introduction of EstimatedReturn, EstimatedLoss, Estimated Yield, ProsperScore and other key variables to the analysis.
The dataset contains 81 features, of which majority are numerical (61), categorical (17) and boolean (3); some of the numerical variables are numerical categories or ratings (such as 'ProsperRating (numeric)', ListingCategory, ProsperScore).
The analysis will focus on
The first question is of interest for potential lenders, who may not be familiar with loan pricing mechanism, or even for auditors and competitors, while the second question is a proxy for how a return estimate relates to actual returns. The latter is a proxy because losses are one part of return (the other part being yield and other fees).
The following are numerical variables with higher absolute correlation coefficients with estimated lender return, categorised:
Few variables, that are highly correlated such as :
Additional variables (non numerical) to be included in analysis:
columns = 3
fig, ax = plt.subplots(ncols=columns, figsize = [12,3])
variables = ['EstimatedReturn', 'EstimatedEffectiveYield', 'EstimatedLoss']
for j in range(columns):
var = variables[j]
ax[j].hist(df[var], bins=50)
ax[j].set_xlabel('{}'.format(var))
df[df.EstimatedReturn < 0]['ListingKey'].count(), df[df.EstimatedEffectiveYield < 0]['ListingKey'].count()
columns = 3
fig, ax = plt.subplots(ncols=columns, figsize = [12,3])
variables = ['EstimatedReturn', 'EstimatedEffectiveYield', 'EstimatedLoss']
for j in range(columns):
var = variables[j]
ax[j].hist(df[var][df[var]>0], bins=50)
ax[j].set_xlabel('{}'.format(var))
print(df.LP_NetPrincipalLoss[df.LP_NetPrincipalLoss == 0].count())
print(df.LP_NetPrincipalLoss[df.LP_NetPrincipalLoss != 0].count())
columns = 2
fig, ax = plt.subplots(ncols=columns, figsize = [6*columns,4])
variables = ['LoanOriginalAmount', 'LP_NetPrincipalLoss']
for j in range(columns):
var = variables[j]
ax[j].hist(df[var][df[var]>0], bins=20)
ax[j].set_xlabel('{}'.format(var));
df['LoanLossRatio'] = df.LP_NetPrincipalLoss/df.LoanOriginalAmount
plt.figure(figsize = [6,4])
plt.hist(df.LoanLossRatio[(df.LoanLossRatio > 0)], bins=100);
columns = 5
fig, ax = plt.subplots(ncols=columns, figsize = [16,3])
variables = ['InquiriesLast6Months', 'CurrentDelinquencies', 'DelinquenciesLast7Years',
'BankcardUtilization', 'AvailableBankcardCredit']
for j in range(columns):
var = variables[j]
ax[j].hist(df[var], bins=50)
ax[j].set_xlabel('{}'.format(var))
columns = 5
rows = 2
fig, ax = plt.subplots(nrows=rows, ncols=columns, figsize = [columns*4,rows*3.5])
variables = ['InquiriesLast6Months', 'CurrentDelinquencies', 'DelinquenciesLast7Years',
'BankcardUtilization', 'AvailableBankcardCredit']
for i in range(rows):
for j in range(columns):
var = variables[j]
step = max(df[var])/100
bins = np.arange(0,max(df[var])+step,step)
if i==1:
step = np.log10(df[var].max())/50
bins = 10 ** np.arange(0, np.log10(df[var].max())+step, step)
ax[i, j].set_xscale('log')
ax[i,j].hist(df[var][(df[var] > 0)], bins=bins)
ax[i, j].set_xlabel('{}'.format(var))
columns = 2
fig, ax = plt.subplots(ncols=columns, figsize = [6*columns,4])
variables = ['CreditScoreRangeLower', 'CreditScoreRangeUpper']
for j in range(columns):
var = variables[j]
ax[j].hist(df[var], bins=20)
ax[j].set_xlabel('{}'.format(var))
sns.scatterplot(data=df, x='CreditScoreRangeLower', y='CreditScoreRangeUpper', alpha = 0.5)
plt.title("Credit Score Lower vs Upper Range Barrier")
plt.xlabel("Lower Barrier")
plt.ylabel("Upper Barrier");
df['CreditScoreMid'] = (df.CreditScoreRangeUpper+df.CreditScoreRangeLower)/2
'DebtToIncomeRatio', 'StatedMonthlyIncome', 'MonthlyLoanPayment', 'IsBorrowerHomeowner'
fig, ax=plt.subplots(ncols=2, figsize=(2*6,4))
bins = np.arange(-1,max(df.DebtToIncomeRatio),0.5)
ax[0].hist(df.DebtToIncomeRatio, bins=bins)
bins = np.arange(0,2,0.1)
ax[1].hist(df.DebtToIncomeRatio, bins=bins);
plt.figure(figsize = [6,4])
bins = np.arange(-1,max(df.StatedMonthlyIncome),10000)
plt.hist(df.StatedMonthlyIncome, bins=bins);
df.StatedMonthlyIncome[df.StatedMonthlyIncome > 50000].count()
df.StatedMonthlyIncome.describe()
bins = np.arange(-1,50000,1000)
plt.hist(df.StatedMonthlyIncome, bins=bins);
fig, ax=plt.subplots(ncols=2, figsize=(2*6,4))
ax[0].hist(df.MonthlyLoanPayment, bins=100)
bins = np.arange(0,1500, 50)
ax[1].hist(df.MonthlyLoanPayment, bins=bins)
ax[1].set_xticks(bins)
ax[1].set_xticklabels(bins,rotation = 90);
df.IsBorrowerHomeowner.value_counts().plot(kind='barh', figsize=[6,4]);
df.Occupation.value_counts().head(50).plot(kind='bar', figsize = [12,4]);
('ListingCategory (numeric)', 'InvestmentFromFriendsCount')
df.LoanStatus.value_counts().sort_index().plot(kind='bar', figsize = [6,4]);
fig, ax=plt.subplots(ncols=2, figsize=(2*6,4))
ax[0].hist(df.LoanOriginalAmount, bins=100)
bins = np.arange(0,35000,1000)
ax[1].hist(df.LoanOriginalAmount, bins=bins);
df.Term.value_counts().sort_index().plot(kind='bar', figsize=[6,4]);
df['ListingCategory (numeric)'].value_counts().sort_index().plot(kind='bar', figsize = [6,4]);
fig, ax=plt.subplots(ncols=2, figsize=(2*6,4))
df['LoanStatus'].value_counts().sort_index().plot(kind='bar', ax=ax[0])
df['LoanStatus'].value_counts().sort_index()[3:].plot(kind='bar', ax=ax[1]);
'ListingCreationDate', 'ClosedDate', 'LoanOriginationDate'
fig, ax = plt.subplots(nrows=3, figsize=[12,12])
df.groupby([df.ListingCreationDate.dt.year,df.ListingCreationDate.dt.month])['ListingCreationDate'].count().plot(kind='bar',
ax=ax[0])
ax[0].set_xlabel('Listing creation period (year, month)')
df.groupby([df.LoanOriginationDate.dt.year,df.LoanOriginationDate.dt.month])['LoanOriginationDate'].count().plot(kind='bar',
ax=ax[1])
ax[1].set_xlabel('Loan origination period (year, month)')
df.groupby([df.ClosedDate.dt.year,df.ClosedDate.dt.month])['ClosedDate'].count().plot(kind='bar',
ax=ax[2])
ax[2].set_xlabel('Listing closed period (year, month)')
plt.subplots_adjust(hspace=0.7);
warehouse_range = (df.LoanOriginationDate - df.ListingCreationDate).dt.days
fig, ax=plt.subplots(ncols=2, figsize=(2*6,4))
ax[0].hist(warehouse_range, bins=100);
step = 0.25
bins = 10**np.arange(0, np.log10(warehouse_range.max())+step, step)
ax[1].hist(warehouse_range, bins=bins)
ax[1].set_xscale('log')
plt.xticks([1,3,10,100,300], [1,3,10,100,300]);
warehouse_range.describe()
print(df.InvestmentFromFriendsAmount[df.InvestmentFromFriendsAmount == 0].count())
print(df.InvestmentFromFriendsAmount[df.InvestmentFromFriendsAmount > 0].count())
columns = 3
fig, ax = plt.subplots(ncols=columns, figsize = [6*columns,4])
df['FriendInvestmentRatio'] = df.InvestmentFromFriendsAmount / df.LoanOriginalAmount
variables = ['InvestmentFromFriendsAmount', 'LoanOriginalAmount', 'FriendInvestmentRatio']
for j in range(columns):
var = variables[j]
ax[j].hist(df[var][df[var]>0], bins=20)
ax[j].set_xlabel('{}'.format(var))
Your answer here!
Your answer here!
In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).
df = df[['EstimatedReturn', 'EstimatedEffectiveYield', 'EstimatedLoss',
'LP_NetPrincipalLoss', 'LoanOriginalAmount', 'LoanLossRatio',
'InquiriesLast6Months', 'CurrentDelinquencies', 'DelinquenciesLast7Years',
'BankcardUtilization', 'AvailableBankcardCredit',
'CreditScoreMid',
'DebtToIncomeRatio', 'StatedMonthlyIncome', 'MonthlyLoanPayment', 'IsBorrowerHomeowner', 'Occupation',
'Term', 'ListingCategory (numeric)', 'LoanStatus',
'ListingCreationDate', 'LoanOriginationDate', 'ClosedDate',
'InvestmentFromFriendsAmount', 'FriendInvestmentRatio']]
df.columns
df.info()
#lets check correlations of all numeric variables
plt.figure(figsize=[20,20])
#compute correlations
corr = df.corr()
# Plot heatmap
sns.heatmap(corr, cmap = "RdBu", vmin = -1, vmax = 1, linewidths=0.5, annot = True, fmt = '.2f')
plt.title('Correlation table (numerical variables)');
fig, ax = plt.subplots(ncols=2, figsize=[16,5])
sns.regplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedEffectiveYield,
fit_reg = True, x_jitter= 0.3, scatter_kws= {'alpha' : 1/50}, ax = ax[0])
sns.regplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedLoss,
fit_reg = True, x_jitter= 0.3, scatter_kws= {'alpha' : 1/50}, ax = ax[1]);
fig, ax = plt.subplots(ncols=2, figsize=[16,5])
sns.pointplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedEffectiveYield, ci = 'sd',ax = ax[0])
sns.pointplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedLoss, ci = 'sd', ax = ax[1]);
fig, ax = plt.subplots(ncols=2, figsize=[16,5])
sns.scatterplot(data=df, x='EstimatedReturn', y='EstimatedEffectiveYield',
hue = df.ListingCreationDate.dt.year, alpha = 0.5, ax=ax[0], legend='full')
ax[0].legend(loc='upper left')
sns.scatterplot(data=df[(df.EstimatedReturn!=df.EstimatedEffectiveYield)&(df.EstimatedReturn>0)],
x='EstimatedReturn', y='EstimatedEffectiveYield', hue = df.ListingCreationDate.dt.year,
alpha = 0.5, ax=ax[1], legend='full')
ax[1].legend(loc='upper left');
df[df.EstimatedReturn!=df.EstimatedEffectiveYield].ListingCreationDate.dt.year.value_counts()
df[df.EstimatedReturn==df.EstimatedEffectiveYield].ListingCreationDate.dt.year.value_counts()
fig, ax = plt.subplots(ncols=2, figsize=[16,5])
sns.scatterplot(data=df, x='EstimatedEffectiveYield', y='EstimatedLoss',
hue = df.ListingCreationDate.dt.year, alpha = 0.5, ax=ax[0], legend='full')
ax[0].legend(loc='upper left')
sns.scatterplot(data=df[(df.EstimatedReturn!=df.EstimatedEffectiveYield)&(df.EstimatedReturn>0)],
x='EstimatedEffectiveYield', y='EstimatedLoss', hue = df.ListingCreationDate.dt.year,
alpha = 0.5, ax=ax[1], legend='full')
ax[1].legend(loc='upper left');
df[df.EstimatedReturn!=df.EstimatedEffectiveYield][['EstimatedReturn','EstimatedEffectiveYield','EstimatedLoss']].corr()
fig, ax = plt.subplots(ncols=2,figsize=[16,5])
del_bins = np.arange(0, 100 + 5, 5)
delinq7yr_bins = pd.cut(df.DelinquenciesLast7Years, del_bins, include_lowest = True)
sns.pointplot(x = delinq7yr_bins, y = df.EstimatedEffectiveYield, ci = 'sd',
linestyles='--', ax=ax[0])
ax[0].set_xticklabels(labels=del_bins,rotation = 90)
inq_bins = np.arange(0, 20 + 2, 2)
inquiries_bins = pd.cut(df.InquiriesLast6Months, inq_bins, include_lowest = True)
sns.pointplot(x = inquiries_bins, y = df.EstimatedEffectiveYield, ci = 'sd',
linestyles='--', ax=ax[1])
ax[1].set_xticklabels(labels=inq_bins,rotation = 90);
status_list = ['Current', 'Completed', 'Defaulted', 'Chargedoff']
df_small = df[df.LoanStatus.isin(status_list)]
fig, ax = plt.subplots(ncols=2,figsize=[18,5])
colors = ['pure blue', 'light grey', 'baby blue', 'light grey', 'light grey',
'light grey', 'light grey', 'light grey', 'light grey',
'lipstick', 'rosa']
palette = sns.xkcd_palette(colors)
del_bins = np.arange(0, 100 + 5, 5)
delinq7yr_bins = pd.cut(df_small.DelinquenciesLast7Years, del_bins, include_lowest = True)
sns.pointplot(x = delinq7yr_bins, y = df_small.EstimatedEffectiveYield, hue=df_small.LoanStatus,
ci = 'sd', linestyles='', ax=ax[0], dodge=True, palette=palette)
ax[0].set_xticklabels(labels=del_bins,rotation = 90)
ax[0].legend(ncol=3,loc='lower left', fontsize=8)
inq_bins = np.arange(0, 20 + 2, 2)
inquiries_bins = pd.cut(df_small.InquiriesLast6Months, inq_bins, include_lowest = True)
sns.pointplot(x = inquiries_bins, y = df_small.EstimatedEffectiveYield, hue=df_small.LoanStatus,
ci = 'sd', linestyles='', ax=ax[1], dodge=True, palette=palette)
ax[1].set_xticklabels(labels=inq_bins,rotation = 90)
ax[1].legend(ncol=3,loc='lower left', fontsize=8);
plt.figure(figsize=[12,5])
df_postitive_yield=df[df.EstimatedEffectiveYield>0]
sns.scatterplot(x=df_postitive_yield.AvailableBankcardCredit, y=df_postitive_yield.EstimatedEffectiveYield,
alpha = 0.03, color='steelblue')
plt.xscale('log')
plt.xticks([1,100,1000,10000,30000, 100000,1000000],[1,100,'1k','10k','30k','100k','1m']);
sns.set(style='whitegrid')
plt.figure(figsize=[12,5])
AvailableBankCredit_bins = pd.cut(df_postitive_yield.AvailableBankcardCredit, [100,1000,10000,30000, 100000,1000000],
include_lowest = True)
sns.violinplot(x=AvailableBankCredit_bins, y=df_postitive_yield.EstimatedEffectiveYield, color='steelblue',
inner='quartile');
plt.figure(figsize=[12,5])
sns.boxplot(x=df_postitive_yield.CreditScoreMid, y=df_postitive_yield.EstimatedEffectiveYield, color='steelblue');
fit, ax = plt.subplots(ncols=2, figsize=[16,5])
df_temp = df_postitive_yield[(df_postitive_yield.DebtToIncomeRatio>0)&(df_postitive_yield.DebtToIncomeRatio<1)]
sns.scatterplot(x=df_temp.DebtToIncomeRatio, y=df_temp.EstimatedEffectiveYield,
alpha = 0.03, color='steelblue', ax=ax[0])
DebtToIncome_bins = pd.cut(df_temp.DebtToIncomeRatio, [0,0.2,0.3,0.4,0.5,0.75,1],
include_lowest = True)
sns.violinplot(x=DebtToIncome_bins, y=df_temp.EstimatedEffectiveYield, color='steelblue',
inner='quartile');
fit, ax = plt.subplots(ncols=2, figsize=[16,5])
df_temp = df_postitive_yield[(df_postitive_yield.StatedMonthlyIncome>0)&(df_postitive_yield.StatedMonthlyIncome<40000)]
sns.scatterplot(x=df_temp.StatedMonthlyIncome, y=df_temp.EstimatedEffectiveYield,
alpha = 0.03, color='steelblue', ax=ax[0])
StatedMonthlyIncome_bins = pd.cut(df_temp.StatedMonthlyIncome, [0,2500,5000,7500,10000,20000,40000],
include_lowest = True)
sns.violinplot(x=StatedMonthlyIncome_bins, y=df_temp.EstimatedEffectiveYield, color='steelblue',
inner='quartile')
plt.xticks(rotation=45);
plt.figure(figsize=[6,5])
sns.boxplot(x=df_postitive_yield.IsBorrowerHomeowner, y=df_postitive_yield.EstimatedEffectiveYield,
color='steelblue');
fix, ax=plt.subplots(ncols=2, figsize=[16,5])
occupation_lowyield = df_postitive_yield.groupby('Occupation')['EstimatedEffectiveYield'].mean().sort_values().head(10).index
occupation_highyield = df_postitive_yield.groupby('Occupation')['EstimatedEffectiveYield'].mean().sort_values().tail(10).index
df_temp_low = df_postitive_yield[df_postitive_yield.Occupation.isin(occupation_lowyield)]
df_temp_high = df_postitive_yield[df_postitive_yield.Occupation.isin(occupation_highyield)]
sns.boxplot(x=df_temp_low.Occupation, y=df_temp_low.EstimatedEffectiveYield,
color='steelblue', order=occupation_lowyield, ax=ax[0])
ax[0].set_xticklabels(occupation_lowyield, rotation=90)
sns.boxplot(x=df_temp_high.Occupation, y=df_temp_high.EstimatedEffectiveYield,
color='salmon', order=occupation_highyield, ax=ax[1])
ax[1].set_xticklabels(occupation_highyield, rotation=90);
fix, ax=plt.subplots(ncols=2, figsize=[16,5])
mask_low = df_temp_low.StatedMonthlyIncome<30000
mask_high = df_temp_high.StatedMonthlyIncome<30000
sns.boxplot(x=df_temp_low[mask_low].Occupation, y=df_temp_low[mask_low].StatedMonthlyIncome,
color='steelblue', order=occupation_lowyield, ax=ax[0])
ax[0].set_xticklabels(occupation_lowyield, rotation=90)
sns.boxplot(x=df_temp_high[mask_high].Occupation, y=df_temp_high[mask_high].StatedMonthlyIncome,
color='salmon', order=occupation_highyield, ax=ax[1])
ax[1].set_xticklabels(occupation_highyield, rotation=90);
sns.set(style='white')
fit, ax = plt.subplots(ncols=2, figsize=[16,5])
sns.scatterplot(x=df_postitive_yield.LoanOriginalAmount, y=df_postitive_yield.EstimatedEffectiveYield,
alpha = 0.05, color='steelblue', ax=ax[0])
OriginalLoanAmount_bins = pd.cut(df_postitive_yield.LoanOriginalAmount, [0,5100,10100,15100,
20100,25100,30100,35100], include_lowest = True)
sns.boxplot(x=OriginalLoanAmount_bins, y=df_postitive_yield.EstimatedEffectiveYield, color='steelblue', ax=ax[1])
plt.xticks(rotation=45);
fig, ax = plt.subplots(ncols=2, figsize=[16,5])
loan_bins = [0,5100,10100,15100,20100,25100,30100,35100]
creditscore_bins = np.arange(601,901,25)
h = ax[0].hist2d(x=df_postitive_yield.LoanOriginalAmount, y=df_postitive_yield.CreditScoreMid,
bins=[loan_bins,creditscore_bins], cmap="Blues", cmin = 100)
plt.colorbar(h[3], ax=ax[0])
OriginalLoanAmount_bins = pd.cut(df_postitive_yield.LoanOriginalAmount, bins=loan_bins, include_lowest = True)
sns.boxplot(x=OriginalLoanAmount_bins, y=df_postitive_yield.CreditScoreMid, color='steelblue', ax=ax[1])
plt.xticks(rotation=45);
sns.set(style='whitegrid')
fig, ax = plt.subplots(ncols=2, figsize=[16,5])
sns.boxplot(x=df.LoanStatus, y=df.EstimatedLoss, color='steelblue', ax=ax[0])
sns.boxplot(x=df.LoanStatus, y=df.LoanLossRatio, color='steelblue', ax=ax[1])
for ax in fig.axes:
plt.sca(ax)
plt.xticks(rotation=90);
Your answer here!
Your answer here!
Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.
cred_bins = np.arange(601,901,50)
creditscore_bins = pd.cut(df.CreditScoreMid, cred_bins, include_lowest = True)
creditscore_index= creditscore_bins.value_counts().sort_index().index
fig, ax = plt.subplots(ncols=2, figsize=[16,5])
palette = sns.color_palette("Blues")
sns.pointplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedEffectiveYield,
hue = creditscore_bins,
ax = ax[0], palette=palette)
sns.pointplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedLoss,
hue = creditscore_bins,
ax = ax[1], palette=palette);
fig, ax = plt.subplots(ncols=2,figsize=[16,5])
palette = sns.color_palette("Blues")
del_bins = np.arange(0, 100 + 5, 5)
delinq7yr_bins = pd.cut(df.DelinquenciesLast7Years, del_bins, include_lowest = True)
sns.pointplot(x = delinq7yr_bins, y = df.EstimatedEffectiveYield,
hue = df.ListingCreationDate.dt.year,
ax=ax[0], palette=palette)
ax[0].set_xticklabels(labels=del_bins,rotation = 90)
inq_bins = np.arange(0, 20 + 2, 2)
inquiries_bins = pd.cut(df.InquiriesLast6Months, inq_bins, include_lowest = True)
sns.pointplot(x = inquiries_bins, y = df.EstimatedEffectiveYield,
hue = df.ListingCreationDate.dt.year,
ax=ax[1], palette=palette)
ax[1].set_xticklabels(labels=inq_bins,rotation = 90);
del_bins = np.arange(0, 100 + 5, 5)
inq_bins = np.arange(0, 20 + 2, 2)
cred_bins = np.arange(601,901,50)
creditscore_bins = pd.cut(df.CreditScoreMid, cred_bins, include_lowest = True)
creditscore_index= creditscore_bins.value_counts().sort_index().index
fig, ax = plt.subplots(nrows =len(creditscore_index), ncols=2,figsize=[16,30])
palette = sns.color_palette("Blues")
row = 0
for credit_score in creditscore_index:
df_credit = df[creditscore_bins == credit_score]
delinq7yr_bins = pd.cut(df_credit.DelinquenciesLast7Years, del_bins, include_lowest = True)
sns.pointplot(x = delinq7yr_bins, y = df_credit.EstimatedEffectiveYield, hue = df_credit.ListingCreationDate.dt.year,
ax=ax[row,0], palette=palette, ci=None)
ax[row,0].set_xticklabels(labels=del_bins,rotation = 45)
ax[row,0].title.set_text('Credit score ={}'.format(credit_score))
inquiries_bins = pd.cut(df_credit.InquiriesLast6Months, inq_bins, include_lowest = True)
sns.pointplot(x = inquiries_bins, y = df_credit.EstimatedEffectiveYield, hue = df_credit.ListingCreationDate.dt.year,
ax=ax[row,1], palette=palette, ci=None)
ax[row,1].set_xticklabels(labels=inq_bins,rotation = 45)
ax[row,1].title.set_text('Credit score ={}'.format(credit_score))
row +=1
plt.subplots_adjust(hspace=0.3);
fig, ax = plt.subplots(ncols=2,figsize=[16,5])
palette = sns.cubehelix_palette(8)
ABC_bins = [100,1000,10000,30000, 100000,1000000]
AvailableBankCredit_bins = pd.cut(df.AvailableBankcardCredit, ABC_bins,
include_lowest = True)
sns.pointplot(x = AvailableBankCredit_bins, y = df.EstimatedEffectiveYield,
hue = df.ListingCreationDate.dt.year,
ax=ax[0], palette=palette)
ax[0].set_xticklabels(labels=ABC_bins,rotation = 90)
DTI_bins = [0,0.2,0.3,0.4,0.5,0.75,1]
DebtToIncome_bins = pd.cut(df.DebtToIncomeRatio, DTI_bins,
include_lowest = True)
sns.pointplot(x = DebtToIncome_bins, y = df.EstimatedEffectiveYield,
hue = df.ListingCreationDate.dt.year,
ax=ax[1], palette=palette)
ax[1].set_xticklabels(labels=DTI_bins,rotation = 90);
ABC_bins = [100,1000,10000,30000, 100000,1000000]
DTI_bins = [0,0.2,0.3,0.4,0.5,0.75,1]
cred_bins = np.arange(601,901,50)
creditscore_bins = pd.cut(df.CreditScoreMid, cred_bins, include_lowest = True)
creditscore_index= creditscore_bins.value_counts().sort_index().index
fig, ax = plt.subplots(nrows =len(creditscore_index), ncols=2,figsize=[16,30])
palette = sns.cubehelix_palette(8)
row = 0
for credit_score in creditscore_index:
df_credit = df[(creditscore_bins == credit_score)]
AvailableBankCredit_bins = pd.cut(df_credit.AvailableBankcardCredit, ABC_bins, include_lowest = True)
sns.pointplot(x = AvailableBankCredit_bins, y = df_credit.EstimatedEffectiveYield,
hue = df_credit.ListingCreationDate.dt.year,
ax=ax[row,0], palette=palette, ci=False)
ax[row,0].set_xticklabels(labels=ABC_bins,rotation = 45)
ax[row,0].title.set_text('Credit score ={}'.format(credit_score))
DebtToIncome_bins = pd.cut(df_credit.DebtToIncomeRatio, DTI_bins, include_lowest = True)
sns.pointplot(x = DebtToIncome_bins, y = df_credit.EstimatedEffectiveYield,
hue = df_credit.ListingCreationDate.dt.year,
ax=ax[row,1], palette=palette, ci=False)
ax[row,1].set_xticklabels(labels=DTI_bins,rotation = 45)
ax[row,1].title.set_text('Credit score ={}'.format(credit_score))
row +=1
plt.subplots_adjust(hspace=0.3);
sns.set(style='white')
years=df_postitive_yield.LoanOriginationDate.dt.year.value_counts().sort_index().index
loan_bins = [0,5100,10100,15100,20100,25100,35100]
creditscore_bins = np.arange(601,901,25)
creditscore_largebins = [600.001,701,901]
OriginalLoanAmount_bins = pd.cut(df_postitive_yield.LoanOriginalAmount, bins=loan_bins, include_lowest = True)
OriginalLoanAmount_index=OriginalLoanAmount_bins.value_counts().sort_index().index
rows = len(OriginalLoanAmount_index)
fig, ax = plt.subplots(nrows = rows, ncols=2, figsize=[16, rows*5])
row = 0
for loan_amount in OriginalLoanAmount_index:
mask = (OriginalLoanAmount_bins == loan_amount)
h = ax[row,0].hist2d(x=df_postitive_yield[mask].LoanOriginationDate.dt.year,
y=df_postitive_yield[mask].CreditScoreMid,
bins=[years,creditscore_bins], cmap="Blues", cmin = 50)
plt.colorbar(h[3], ax=ax[row,0])
ax[row,0].title.set_text('Loan amount range ={}'.format(loan_amount))
CreditScore_large_bins = pd.cut(df_postitive_yield[mask].CreditScoreMid, bins=creditscore_largebins,
include_lowest = True)
sns.boxplot(x=df_postitive_yield[mask].LoanOriginationDate.dt.year,
y=df_postitive_yield[mask].EstimatedEffectiveYield,
hue=CreditScore_large_bins,
palette='Blues', ax=ax[row,1])
ax[row,1].title.set_text('Loan amount range ={}'.format(loan_amount))
row +=1
plt.subplots_adjust(hspace=0.3);
At the end of your report, make sure that you export the notebook as an html file from the
File > Download as... > HTMLmenu. Make sure you keep track of where the exported file goes, so you can put it in the same folder as this notebook for project submission. Also, make sure you remove all of the quote-formatted guide notes like this one before you finish your report!